Statistical Programming with R

We use the following packages

library(MASS)     # cats data
library(psych)    # describe function
library(dplyr)    # data manipulation 
library(magrittr) # pipes
library(ggplot2)  # visualization
library(mice)     # boys data
library(haven)    # read SPSS/SAS/STATA data
library(caret)    # crossvalidation

Pipes

This is a pipe:

boys <- 
  read_sav("boys.sav") %>%
  head()

It effectively replaces head(read_sav("boys.sav")).

Why are pipes useful?

Let’s assume that we want to load data, change a variable, filter cases and select columns. Without a pipe, this would look like

boys  <- read_sav("boys.sav")
boys2 <- transform(boys, hgt = hgt / 100)
boys3 <- filter(boys2, age > 15)
boys4 <- subset(boys3, select = c(hgt, wgt, bmi))

With the pipe:

boys <-
  read_sav("boys.sav") %>%
  transform(hgt = hgt/100) %>%
  filter(age > 15) %>%
  subset(select = c(hgt, wgt, bmi))

Benefit: a single object in memory that is easy to interpret

With pipes

Your code becomes more readable:

  • data operations are structured from left-to-right and not from in-to-out
  • nested function calls are avoided
  • local variables and copied objects are avoided
  • easy to add steps in the sequence

What do pipes do:

  • f(x) becomes x %>% f()
rnorm(10) %>% mean()
## [1] 0.03589266
  • f(x, y) becomes x %>% f(y)
boys[, 1:5] %>% cor(use = "pairwise.complete.obs")
##           age       hgt       wgt       bmi        hc
## age 1.0000000 0.9752568 0.9505762 0.6319678 0.8572431
## hgt 0.9752568 1.0000000 0.9428906 0.5999647 0.9123139
## wgt 0.9505762 0.9428906 1.0000000 0.7976402 0.8374706
## bmi 0.6319678 0.5999647 0.7976402 1.0000000 0.5912613
## hc  0.8572431 0.9123139 0.8374706 0.5912613 1.0000000
  • h(g(f(x))) becomes x %>% f %>% g %>% h
boys %>% subset(select = wgt) %>% na.omit() %>% max()
## [1] 117.4

More pipe stuff

The standard %>% pipe

HTML5 Icon


boys %>% psych::describe(skew = FALSE)
##      vars   n   mean    sd   min    max  range   se
## age     1 748   9.16  6.89  0.04  21.18  21.14 0.25
## hgt     2 728 132.15 46.51 50.00 198.00 148.00 1.72
## wgt     3 744  37.15 26.03  3.14 117.40 114.26 0.95
## bmi     4 727  18.07  3.05 11.77  31.74  19.97 0.11
## hc      5 702  51.51  5.91 33.70  65.00  31.30 0.22
## gen*    6 245   3.12  1.58  1.00   5.00   4.00 0.10
## phb*    7 245   3.36  1.88  1.00   6.00   5.00 0.12
## tv      8 226  11.89  7.99  1.00  25.00  24.00 0.53
## reg*    9 745   3.02  1.14  1.00   5.00   4.00 0.04

The %$% pipe

HTML5 Icon


boys %$% mean(age)
## [1] 9.158866

The role of . in a pipe

In a %>% b(arg1, arg2, arg3), a will become arg1. With . we can change this.

cats %>%
  plot(Hwt ~ Bwt, data = .)

VS

cats %$%
  plot(Hwt ~ Bwt)

The . can be used as a placeholder in the pipe.

Performing a t-test in a pipe

cats %$%
  t.test(Hwt ~ Sex)
## 
##  Welch Two Sample t-test
## 
## data:  Hwt by Sex
## t = -6.5179, df = 140.61, p-value = 1.186e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.763753 -1.477352
## sample estimates:
## mean in group F mean in group M 
##        9.202128       11.322680

is the same as

t.test(Hwt ~ Sex, data = cats)

Storing a t-test from a pipe

cats.test <- cats %$%
  t.test(Bwt ~ Sex)

cats.test
## 
##  Welch Two Sample t-test
## 
## data:  Bwt by Sex
## t = -8.7095, df = 136.84, p-value = 8.831e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6631268 -0.4177242
## sample estimates:
## mean in group F mean in group M 
##        2.359574        2.900000

Data visualization with ggplot2

The anscombe data

anscombe
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

Fitting a line

anscombe %>%
  ggplot(aes(y1, x1)) + 
  geom_point() + 
  geom_smooth(method = "lm")

Fitting a line

Why visualise?

  • We can process a lot of information quickly with our eyes
  • Plots give us information about
    • Distribution / shape
    • Irregularities
    • Assumptions
    • Intuitions
  • Summary statistics, correlations, parameters, model tests, p-values do not tell the whole story

ALWAYS plot your data!

Why visualise?

Source: Anscombe, F. J. (1973). “Graphs in Statistical Analysis”. American Statistician. 27 (1): 17–21.

Why visualise?

What is ggplot2?

Layered plotting based on the book The Grammer of Graphics by Leland Wilkinsons.

With ggplot2 you

  1. provide the data
  2. define how to map variables to aesthetics
  3. state which geometric object to display
  4. (optional) edit the overall theme of the plot

ggplot2 then takes care of the details

An example: scatterplot

1: Provide the data

mice::boys %>%
  ggplot()

2: map variable to aesthetics

mice::boys %>%
  ggplot(aes(x = age, y = bmi))

3: state which geometric object to display

mice::boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point()

An example: scatterplot

Why this syntax?

Create the plot

gg <- 
  mice::boys %>%
  ggplot(aes(x = age, y = bmi)) +
  geom_point(col = "dark green")

Add another layer (smooth fit line)

gg <- gg + 
  geom_smooth(col = "dark blue")

Give it some labels and a nice look

gg <- gg + 
  labs(x = "Age", y = "BMI", title = "BMI trend for boys") +
  theme_minimal()

Why this syntax?

plot(gg)

Why this syntax?

Linear models

Notation

The mathematical formulation of the relationship between variables can be written as

\[ \mbox{observed}=\mbox{predicted}+\mbox{error} \]

or (for the greek people) in notation as \[y=\mu+\varepsilon\]

where

  • \(\mu\) (mean) is the part of the outcome that is explained by model
  • \(\varepsilon\) (residual) is the part of outcome that is not explained by model

Univariate expectation

Conditional expectation

Assumptions

The key assumptions

There are four key assumptions about the use of linear regression models.

In short, we assume

  • The outcome to have a linear relation with the predictors and the predictor relations to be additive.
    • the expected value for the outcome is a straight-line function of each predictor, given that the others are fixed.
    • the slope of each line does not depend on the values of the other predictors
    • the effects of the predictors on the expected value are additive

    \[ y = \alpha + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \epsilon\]

  • The residuals are statistically independent
  • The residual variance is constant
    • accross the expected values
    • across any of the predictors
  • The residuals are normally distributed with mean \(\mu_\epsilon = 0\)

A simple model

fit <- anscombe %$%
  lm(y1 ~ x1)
fit
## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Coefficients:
## (Intercept)           x1  
##      3.0001       0.5001
fit2 <- anscombe %$%
  lm(y2 ~ x2)

Checking assumptions

  1. linearity
    • scatterplot \(y\) and \(x\)
    • include loess curve when in doubt
    • does a squared term improve fit?
  2. normality residuals
    • normal probability plots qqnorm()
    • if sample is small; qqnorm with simulated errors cf. rnorm(n, 0, s)
  3. constant error variance
    • residual plot
    • scale-location plot

Linearity

anscombe %>%
  ggplot(aes(x1, y1)) + 
  geom_point() + 
  geom_smooth(method = "loess", col = "blue") + 
  geom_smooth(method = "lm", col = "orange")

Linearity

The loess curve suggests slight non-linearity

Adding a squared term

anscombe %$%
  lm(y1 ~ x1 + I(x1^2)) %>%
  summary()
## 
## Call:
## lm(formula = y1 ~ x1 + I(x1^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8704 -0.3481 -0.2456  0.7129  1.8072 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.75507    3.28815   0.230    0.824
## x1           1.06925    0.78982   1.354    0.213
## I(x1^2)     -0.03162    0.04336  -0.729    0.487
## 
## Residual standard error: 1.27 on 8 degrees of freedom
## Multiple R-squared:  0.6873, Adjusted R-squared:  0.6092 
## F-statistic: 8.793 on 2 and 8 DF,  p-value: 0.009558

Constant error variance?

par(mfrow = c(1, 2))
fit %>%
  plot(which = c(1, 3), cex = .6)

No constant error variance!

par(mfrow = c(1, 2))
boys %$%
  lm(bmi ~ age) %>%
  plot(which = c(1, 3), cex = .6)

Normality of errors

fit %>%
  plot(which = 2, cex = .6)

The QQplot shows some divergence from normality at the tails

Outliers, influence and robust regression

Outliers and influential cases

Leverage: see the fit line as a lever.

  • some points pull/push harder; they have more leverage

Standardized residuals:

  • The values that have more leverage tend to be closer to the line
  • The line is fit so as to be closer to them
  • The residual standard deviation can differ at different points on \(X\) - even if the error standard deviation is constant.
  • Therefore we standardize the residuals so that they have constant variance (assuming homoscedasticity).

Cook’s distance: how far the predicted values would move if your model were fit without the data point in question.

  • it is a function of the leverage and standardized residual associated with each data point

Fine

High leverage, low residual

Low leverage, high residual

High leverage, high residual

Outliers and influential cases

Outliers are cases with large \(e_z\) (standardized residuals).

If the model is correct we expect:

  • 5% of \(|e_z|>1.96\)
  • 1% of \(|e_z|>2.58\)
  • 0% of \(|e_z|>3.3\)

Influential cases are cases with large influence on parameter estimates

  • cases with Cook’s Distance \(> 1\), or
  • cases with Cook’s Distance much larger than the rest

Outliers and influential cases

par(mfrow = c(1, 2), cex = .6)
fit %>% plot(which = c(4, 5))

There are no cases with \(|e_z|>2\), so no outliers (right plot). There are no cases with Cook’s Distance \(>1\), but case 3 stands out

Model fit

A simple model

boys.fit <- 
  na.omit(boys) %$% # Extremely wasteful
  lm(age ~ reg)
boys.fit
## 
## Call:
## lm(formula = age ~ reg)
## 
## Coefficients:
## (Intercept)      regeast      regwest     regsouth      regcity  
##     14.7109      -0.8168      -0.7044      -0.6913      -0.6663
boys %>% na.omit(boys) %$% aggregate(age, list(reg), mean)
##   Group.1        x
## 1   north 14.71094
## 2    east 13.89410
## 3    west 14.00657
## 4   south 14.01965
## 5    city 14.04460

Plotting the model

means <- boys %>% na.omit(boys) %>% group_by(reg) %>% summarise(age = mean(age))
ggplot(na.omit(boys), aes(x = reg, y = age)) + 
  geom_point(color = "grey") + 
  geom_point(data = means, stat = "identity", size = 3)

Model parameters

boys.fit %>%
  summary()
## 
## Call:
## lm(formula = age ~ reg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8519 -2.5301  0.0254  2.2274  6.2614 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.7109     0.5660  25.993   <2e-16 ***
## regeast      -0.8168     0.7150  -1.142    0.255    
## regwest      -0.7044     0.6970  -1.011    0.313    
## regsouth     -0.6913     0.6970  -0.992    0.322    
## regcity      -0.6663     0.9038  -0.737    0.462    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.151 on 218 degrees of freedom
## Multiple R-squared:  0.006745,   Adjusted R-squared:  -0.01148 
## F-statistic: 0.3701 on 4 and 218 DF,  p-value: 0.8298

Scientific notation

If you have trouble reading scientific notation, 2e-16 means the following

\[2\text{e-16} = 2 \times 10^{-16} = 2 \times (\frac{1}{10})^{-16}\]

This indicates that the comma should be moved 16 places to the left:

\[2\text{e-16} = 0.0000000000000002\]

AIC

Akaike’s An Information Criterion

boys.fit %>% 
  AIC()
## [1] 1151.684

What is AIC

AIC comes from information theory and can be used for model selection. The AIC quantifies the information that is lost by the statistical model, through the assumption that the data come from the same model. In other words: AIC measures the fit of the model to the data.

  • The better the fit, the less the loss in information
  • AIC works on the log scale:
    • \(\text{log}(0) = -\infty\), \(\text{log}(1) = 0\), etc.
  • the closer the AIC is to \(-\infty\), the better

Model comparison

A new model

Let’s add predictor hgt to the model:

boys.fit2 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt)

boys.fit %>% AIC()
## [1] 1151.684
boys.fit2 %>% AIC()
## [1] 836.3545

Another model

Let’s add wgt to the model

boys.fit3 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt + wgt)

And another model

Let’s add wgt and the interaction between wgt and hgt to the model

boys.fit4 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt * wgt)

is equivalent to

boys.fit4 <- 
  na.omit(boys) %$%
  lm(age ~ reg + hgt + wgt + hgt:wgt)

Model comparison

AIC(boys.fit, boys.fit2, boys.fit3, boys.fit4)
##           df       AIC
## boys.fit   6 1151.6839
## boys.fit2  7  836.3545
## boys.fit3  8  822.4164
## boys.fit4  9  823.0386

And with anova()

anova(boys.fit, boys.fit2, boys.fit3, boys.fit4)
## Analysis of Variance Table
## 
## Model 1: age ~ reg
## Model 2: age ~ reg + hgt
## Model 3: age ~ reg + hgt + wgt
## Model 4: age ~ reg + hgt * wgt
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    218 2164.59                                    
## 2    217  521.64  1   1642.94 731.8311 < 2.2e-16 ***
## 3    216  485.66  1     35.98  16.0276 8.595e-05 ***
## 4    215  482.67  1      2.99   1.3324    0.2497    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Inspect boys.fit3

boys.fit3 %>% anova()
## Analysis of Variance Table
## 
## Response: age
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## reg         4   14.70    3.67   1.6343    0.1667    
## hgt         1 1642.94 1642.94 730.7065 < 2.2e-16 ***
## wgt         1   35.98   35.98  16.0029 8.688e-05 ***
## Residuals 216  485.66    2.25                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Inspect boys.fit4

boys.fit4 %>% anova()
## Analysis of Variance Table
## 
## Response: age
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## reg         4   14.70    3.67   1.6368    0.1661    
## hgt         1 1642.94 1642.94 731.8311 < 2.2e-16 ***
## wgt         1   35.98   35.98  16.0276 8.595e-05 ***
## hgt:wgt     1    2.99    2.99   1.3324    0.2497    
## Residuals 215  482.67    2.24                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems that reg and the interaction hgt:wgt are redundant

Remove reg

boys.fit5 <- 
  na.omit(boys) %$%
  lm(age ~ hgt + wgt)

Let’s revisit the comparison

anova(boys.fit, boys.fit2, boys.fit3, boys.fit5)
## Analysis of Variance Table
## 
## Model 1: age ~ reg
## Model 2: age ~ reg + hgt
## Model 3: age ~ reg + hgt + wgt
## Model 4: age ~ hgt + wgt
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    218 2164.59                                    
## 2    217  521.64  1   1642.94 730.7065 < 2.2e-16 ***
## 3    216  485.66  1     35.98  16.0029 8.688e-05 ***
## 4    220  492.43 -4     -6.77   0.7529    0.5571    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The boys.fit5 model is better than the previous model - it has fewer parameters

Influence of cases

DfBeta calculates the change in coefficients depicted as deviation in SE’s.

boys.fit5 %>%
  dfbeta() %>%
  head(n = 7)
##   (Intercept)           hgt           wgt
## 1  0.08023815 -0.0006010829  3.886307e-04
## 2 -0.16849516  0.0011153227 -4.813872e-04
## 3 -0.08258333  0.0005122980 -1.222825e-04
## 4 -0.04399686  0.0002530928 -8.689133e-06
## 5 -0.28701562  0.0021630263 -1.581283e-03
## 6 -0.06116123  0.0002818449  1.652042e-04
## 7 -0.04791078  0.0002228673  1.274280e-04

Prediction

Fitted values

Let’s use the simpler anscombe data example

fit <- anscombe %$% lm(y1 ~ x1)

y_hat <- 
  fit %>%
  fitted.values()

The residual is then calculated as

y_hat - anscombe$y1
##           1           2           3           4           5           6 
## -0.03900000  0.05081818  1.92127273 -1.30909091  0.17109091  0.04136364 
##           7           8           9          10          11 
## -1.23936364  0.74045455 -1.83881818  1.68072727 -0.17945455

Predict new values

If we introduce new values for the predictor x1, we can generate predicted values from the model

new.x1 <- data.frame(x1 = 1:20)
fit %>% predict(newdata = new.x1)
##         1         2         3         4         5         6         7 
##  3.500182  4.000273  4.500364  5.000455  5.500545  6.000636  6.500727 
##         8         9        10        11        12        13        14 
##  7.000818  7.500909  8.001000  8.501091  9.001182  9.501273 10.001364 
##        15        16        17        18        19        20 
## 10.501455 11.001545 11.501636 12.001727 12.501818 13.001909

Prediction intervals

fit %>% predict(interval = "prediction")
##          fit      lwr       upr
## 1   8.001000 5.067072 10.934928
## 2   7.000818 4.066890  9.934747
## 3   9.501273 6.390801 12.611745
## 4   7.500909 4.579129 10.422689
## 5   8.501091 5.531014 11.471168
## 6  10.001364 6.789620 13.213107
## 7   6.000636 2.971271  9.030002
## 8   5.000455 1.788711  8.212198
## 9   9.001182 5.971816 12.030547
## 10  6.500727 3.530650  9.470804
## 11  5.500545 2.390073  8.611017

A prediction interval reflects the uncertainty around a single value. The confidence interval reflects the uncertainty around the mean prediction values.

Assessing predictive accuracy

K-fold cross-validation

  • Divide sample in \(k\) mutually exclusive training sets
  • Do for all \(j\in\{1,\dots,k\}\) training sets

    1. fit model to training set \(j\)
    2. obtain predictions for test set \(j\) (remaining cases)
    3. compute residual variance (MSE) for test set \(j\)
  • Compare MSE in cross-validation with MSE in sample
  • Small difference suggests good predictive accuracy

The original model

fit %>% summary()
## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92127 -0.45577 -0.04136  0.70941  1.83882 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   3.0001     1.1247   2.667  0.02573 * 
## x1            0.5001     0.1179   4.241  0.00217 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 9 degrees of freedom
## Multiple R-squared:  0.6665, Adjusted R-squared:  0.6295 
## F-statistic: 17.99 on 1 and 9 DF,  p-value: 0.00217

K-fold cross-validation anscombe data

set.seed(123)
train.control <- trainControl(method = "cv", number = 3)
model <- train(y1 ~ x1, data = anscombe, method = "lm",
               trControl = train.control)
model
## Linear Regression 
## 
## 11 samples
##  1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 7, 8, 7 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   1.401633  0.7152385  1.152423
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Another approach

library(lmvar)
fit <- anscombe %$% lm(y1 ~ x1, y = TRUE, x = TRUE)
lmvar::cv.lm(fit, k = 3, seed = 123)
## Mean absolute error        :  1.188572 
## Sample standard deviation  :  0.5803123 
## 
## Mean squared error         :  2.143293 
## Sample standard deviation  :  1.37752 
## 
## Root mean squared error    :  1.402769 
## Sample standard deviation  :  0.5131254

K-fold cross-validation boys data

set.seed(123)
train.control <- trainControl(method = "cv", number = 3)
model <- train(age ~ hgt + wgt, data = na.omit(boys), method = "lm",
               trControl = train.control)
model
## Linear Regression 
## 
## 223 samples
##   2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 150, 149, 147 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   1.484762  0.7751937  1.153696
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Another approach

fit <- boys %$% lm(age ~ hgt + wgt, y = TRUE, x = TRUE)
lmvar::cv.lm(fit, k = 3)
## Mean absolute error        :  1.088759 
## Sample standard deviation  :  0.06427564 
## 
## Mean squared error         :  1.942275 
## Sample standard deviation  :  0.2396558 
## 
## Root mean squared error    :  1.391816 
## Sample standard deviation  :  0.08767143

How many cases are used?

na.omit(boys) %$%
  lm(age ~ hgt + wgt) %>%
  nobs()
## [1] 223

If we would not have used na.omit()

boys %$%
  lm(age ~ hgt + wgt) %>%
  nobs()
## [1] 727

Some other modeling devices in R

  • lm(): linear modeling
  • glm(): generalized linear modeling
  • gamlss::gamlss: generalized additive models (for location, scale and shape)
  • lme4::lmer: linear mixed effect models
  • lme4::glmer: generalized linear mixed effect models
  • lme4::nlmer: non-linear mixed effect models